Subimited by: Susan Bataju
For the Lab 3, Stellar Classification Dataset was choosen [1][2]. The light for far away objects are red-shifted because the universe is expanding and the wavelenght of light is streched. Quasar are mainly found in the center of galaxy and are the most energetic stallar objects, they are also more common in early universe. The objects which were formed in distance past is more redshifed than objects which are relatively young, so Quasar are vital for study of early universe.
The data consists of 100,000 observations of space taken by the SDSS (Sloan Digital Sky Survey). Every observation is described by 17 feature columns and 1 class column which identifies it to be either a star, galaxy or quasar. Following features were present is the dataset:
obj_ID = Object Identifier, the unique value that identifies the object in the image catalog used by the CAS alpha = Right Ascension angle (at J2000 epoch) delta = Declination angle (at J2000 epoch) u = Ultraviolet filter in the photometric system g = Green filter in the photometric system r = Red filter in the photometric system i = Near Infrared filter in the photometric system z = Infrared filter in the photometric system run_ID = Run Number used to identify the specific scan rereun_ID = Rerun Number to specify how the image was processed cam_col = Camera column to identify the scanline within the run field_ID = Field number to identify each field spec_obj_ID = Unique ID used for optical spectroscopic objects (this means that 2 different observations with the same spec_obj_ID must share the output class) class = object class (galaxy, star or quasar object) redshift = redshift value based on the increase in wavelength plate = plate ID, identifies each plate in SDSS MJD = Modified Julian Date, used to indicate when a given piece of SDSS data was taken fiber_ID = fiber ID that identifies the fiber that pointed the light at the focal plane in each observation
The use of classification technique stude in this report is most useful for astrophysicist studing early universe and the classification task is to classify object in three catagories Galaxies,Stars or Quasar. The model would be use in offline analysis to seperate Galaxies,Stars or Quasar as we study the data gathered throght January 2021 (latest).
The idea for PCA was taken from https://www.kaggle.com/code/wessamwalid/stellar-classification-sdss17-4-ml-models#Missing-Value-Analysis.
import sklearn
from sklearn.datasets import load_iris
import numpy as np
from sklearn.metrics import accuracy_score
from scipy.special import expit
import os
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import OneHotEncoder ,LabelEncoder
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
import copy
from copy import deepcopy
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
import optuna
import xgboost as xgb
from sklearn.metrics import accuracy_score,f1_score,roc_auc_score,confusion_matrix,roc_curve,auc
from optuna.visualization import plot_contour
from optuna.visualization import plot_edf
from optuna.visualization import plot_intermediate_values
from optuna.visualization import plot_optimization_history
from optuna.visualization import plot_parallel_coordinate
from optuna.visualization import plot_param_importances
from optuna.visualization import plot_slice
import plotly.io as pio
pio.renderers.default = 'iframe' # or 'notebook' or 'colab' or 'jupyterlab'
# ds = load_iris()
# X = ds.data
# y = (ds.target>1).astype(int) # make problem binary
def plot_sigbkg(class_,name,y_train,ytrainhat,y_train1,yvalhat,y_test1,yhat):
# plt.figure(figsize=(6, 4))
plt.hist(yhat[:, class_][y_test1==class_],label='test sig',density=True,alpha=0.5)
plt.hist(yhat[:, class_][y_test1!=class_],label='test bkg',density=True,alpha=0.5)
plt.hist(ytrainhat[:, class_][y_train==class_],label='train sig',density=True,histtype='step')
plt.hist(ytrainhat[:, class_][y_train!=class_],label='train bkg',density=True,histtype='step')
counts,bin_edges = np.histogram(yvalhat[:, class_][y_train1==class_],density=True)
bin_centers = (bin_edges[:-1] + bin_edges[1:])/2.
plt.plot(bin_centers, counts,marker="o",linestyle="None",label="val sig")
counts,bin_edges = np.histogram(yvalhat[:, class_][y_train1!=class_],density=True)
bin_centers = (bin_edges[:-1] + bin_edges[1:])/2.
plt.plot(bin_centers, counts,marker="*",linestyle="None",label="val bkg")
plt.title(name)
plt.legend()
plt.tight_layout()
# plt.show()
def plot_roc(class_,name,y_train,ytrainhat,y_train1,yvalhat,y_test1,yhat):
# plt.figure(figsize=(6,4))
roc1 = roc_curve(1*(y_test1==class_) , yhat[:, class_])
fpr,tpr,_=roc1
plt.plot(fpr, tpr, 'b',label=f'test (area = {auc(fpr,tpr)*100:.1f})%')
roc1 = roc_curve(1*(y_train==class_) , ytrainhat[:, class_])
fpr,tpr,_=roc1
plt.plot(fpr, tpr, 'r',label=f'train (area = {auc(fpr,tpr)*100:.1f})%')
roc1 = roc_curve(1*(y_train1==class_) , yvalhat[:, class_])
fpr,tpr,_=roc1
plt.plot(fpr, tpr, 'g',label=f'val (area = {auc(fpr,tpr)*100:.1f})%')
plt.legend()
plt.title(name)
# plt.show()
# https://www.kaggle.com/datasets/fedesoriano/stellar-classification-dataset-sdss17
# https://www.kaggle.com/code/wessamwalid/stellar-classification-sdss17-4-ml-models#Missing-Value-Analysis
# load the data
df = pd.read_csv("star_classification.csv")
df.head()
df['class'].unique() # three classes
le = LabelEncoder() # hot hot encoded
le.fit(df['class'].unique())
target = le.transform(df['class'])
df['target'] = target
df.drop(['class'],inplace=True,axis=1)
The classes are encoded as such and the total instances of each class is shown below.
print(len(target[target==0]), '--> 0 GALAXY')
print(len(target[target==1]), '--> 1 QSO')
print(len(target[target==2]), '--> 2 STAR')
ig, ax = plt.subplots()
bars=plt.bar(le.classes_, [len(target[target==i]) for i in range(len(le.classes_)) ])
for index,data in enumerate([len(target[target==i]) for i in range(len(le.classes_)) ]):
plt.text(x=index,y=10000,s=f'{data/len(target)*100:.0f}%',c='white',va='center', fontweight='bold')
plt.title("Class")
plt.show()
The data is split into 60%-20%-20%. To deal with imbalanced classes,
from sklearn.utils.class_weight import compute_class_weight
# https://scikit-learn.org/stable/modules/generated/sklearn.utils.class_weight.compute_class_weight.html
compute_class_weight('balanced',np.unique(target), target )
# If ‘balanced’, class weights will be given by n_samples / (n_classes * np.bincount(y))
# also can be calculated like so.
print([len(target)/(3*i) for i in np.bincount(target) ])
df.describe()
print(df.info())
# missing value
df.isnull().sum()
There are no missing values and all the features are numerical. I have plotted and histogram and boxplot for each feature below.
for var in df:
# print(var)
plt.subplot(1,2,1)
plt.hist(df[var])
plt.title(var)
plt.subplot(1,2,2)
plt.boxplot(df[var])
plt.show()
sns.pairplot(df, diag_kind = "kde")
plt.show()
plt.figure(figsize = (14,10))
sns.heatmap(df.corr(), annot = True, fmt = ".1f", linewidths = .7)
plt.show()
Let's also drop name ID of objects, those are obj_ID and run_ID. Also drop field_ID,cam_col, fiber_ID those seem to be varaible related to the telescope.
df = df.drop(['obj_ID','run_ID','cam_col','rerun_ID','field_ID','fiber_ID'], axis = 1)
df_o = deepcopy(df)
df.head()
Let's now standarize the dataset for PCA.
y= df.pop('target')
X=deepcopy(df)
var_used = X.columns
scaler = StandardScaler()
X = pd.DataFrame(scaler.fit_transform(X),columns=var_used)
len(var_used)
pca = PCA()
principalComponents = pca.fit_transform(X)
principalDf = pd.DataFrame(data = principalComponents
, columns = [f'principal component {i}' for i in range(len(var_used))])
finalDf = pd.concat([principalDf, pd.Series(y,name='target')], axis = 1)
finalDf
pca.explained_variance_ratio_
explained_var = pca.explained_variance_ratio_
cum_var_exp = np.cumsum(explained_var)
plt.figure(figsize=[10,8])
plt.bar([f'pc {i}' for i in range(len(var_used))],explained_var,label='individual explained variance')
plt.plot([f'pc {i}' for i in range(len(var_used))],cum_var_exp,'or--',label='cumulative explained variance')
plt.ylabel('Explained variance ratio')
plt.xlabel("Principal components")
for i,j in zip([f'pc {i}' for i in range(len(var_used))],cum_var_exp):
plt.text(i,j+0.01,f'{j:.2f}')
plt.legend()
plt.show()
It can be seen that with first five principal component contains 99% of variance so we will use the first five PCA variable for the rest of the lab. The data is frist split into two groups with 80% training data and the 20% remaining data is split again into validation set (16% of the total data) and test set (4% of the total data).
x_train, x_test, y_train, y_test = train_test_split(principalDf[principalDf.columns[:5]], y, test_size = 0.20, random_state = 42)
Training set is 80000 events
print("x_train: {}".format(x_train.shape))
print("x_test: {}".format(x_test.shape))
print("y_train: {}".format(y_train.shape))
print("y_test: {}".format(y_test.shape))
x_train1, x_test1, y_train1, y_test1 = train_test_split(x_test, y_test, test_size = 0.20, random_state = 42)
Validation set is 16000 events and test set is 4000 events
print("x_train1: {}".format(x_train1.shape))
print("x_test1: {}".format(x_test1.shape))
print("y_train1: {}".format(y_train1.shape))
print("y_test1: {}".format(y_test1.shape))
sampled_df = pd.concat([ finalDf[finalDf.target==0].sample(20000,random_state=4), finalDf[finalDf.target==1],finalDf[finalDf.target==2]])
sampled_df
ig, ax = plt.subplots()
bars=plt.bar(le.classes_, [len(sampled_df[sampled_df.target==i]) for i in range(len(le.classes_)) ])
for index,data in enumerate([len(sampled_df[sampled_df.target==i]) for i in range(len(le.classes_)) ]):
plt.text(x=index,y=10000,s=f'{data/len(target)*100:.0f}%',c='white',va='center', fontweight='bold')
plt.title("Class")
plt.show()
sampled_y = sampled_df.pop('target')
So we have another dataset with equal number of observations and lets split this dataset into 60% training 20% validataion and 20% testing set.
x_trains, x_tests, y_trains, y_tests = train_test_split(sampled_df[sampled_df.columns[:5]], sampled_y, test_size = 0.60, random_state = 42)
print("x_train: {}".format(x_trains.shape))
print("y_train: {}".format(y_trains.shape))
x_trainsv,x_testsv, y_trainsv, y_testsv = train_test_split(x_tests, y_tests, test_size = 0.50, random_state = 42)
print("x_val : {}".format(x_trainsv.shape))
print("y_val: {}".format(y_trainsv.shape))
print("x_test: {}".format(x_testsv.shape))
print("y_test: {}".format(y_testsv.shape))
%%time
# from last time, our logistic regression algorithm is given by (including everything we previously had):
class BinaryLogisticRegression:
def __init__(self, eta, iterations=20, C=0.001,do_C=False,do_alpha =False,alpha=0.001,do_C_alpha=False,sample_weight=False):
self.eta = eta
self.iters = iterations
self.C = C
self.do_C = do_C
self.alpha = alpha
self.do_alpha = do_alpha
self.do_C_alpha = do_C_alpha
self.sample_weight = sample_weight
# internally we will store the weights as self.w_ to keep with sklearn conventions
def __str__(self):
if(hasattr(self,'w_')):
return 'Binary Logistic Regression Object with coefficients:\n'+ str(self.w_) # is we have trained the object
else:
return 'Untrained Binary Logistic Regression Object'
# convenience, private:
@staticmethod
def _add_bias(X):
return np.hstack((np.ones((X.shape[0],1)),X)) # add bias term
@staticmethod
def _sigmoid(theta):
# increase stability, redefine sigmoid operation
return expit(theta) #1/(1+np.exp(-theta))
@staticmethod
def get_sample_weight(y):
sw = [len(y)/(len(np.unique(y))*i) for i in np.bincount(y) ]
return np.array([sw[j] for j in y])
# vectorized gradient calculation with regularization using L2 Norm and L1 Norm
def _get_gradient(self,X,y):
ydiff = y-self.predict_proba(X,add_bias=False).ravel() # get y difference
gradient = np.mean(X * ydiff[:,np.newaxis], axis=0) # make ydiff a column vector and multiply through
gradient = gradient.reshape(self.w_.shape)
if self.do_C:
gradient[1:] += -2 * self.w_[1:] * self.C
if self.do_alpha:
# from https://stats.stackexchange.com/questions/45643/why-l1-norm-for-sparse-models
gradient[1:] += - self.alpha * np.sign(self.w_[1:])
if self.do_C_alpha:
gradient[1:] += - self.alpha * np.sign(self.w_[1:]) - 2 * self.w_[1:] * self.C
return gradient
# public:
def predict_proba(self,X,add_bias=True):
# add bias term if requested
Xb = self._add_bias(X) if add_bias else X
return self._sigmoid(Xb @ self.w_) # return the probability y=1
def predict(self,X):
return (self.predict_proba(X)>0.5) #return the actual prediction
def fit(self, X, y):
Xb = self._add_bias(X) # add bias term
num_samples, num_features = Xb.shape
self.w_ = np.zeros((num_features,1)) # init weight vector to zeros
# print(self.get_sample_weight(y))
# for as many as the max iterations
for _ in range(self.iters):
gradient = self._get_gradient(Xb,y)
self.w_ += gradient*self.eta # multiply by learning rate
# self.w_ *= self.get_sample_weight(y) # multiply weight with sample weight
# add bacause maximizing
blr = BinaryLogisticRegression(eta=0.1,iterations=50,C=0.01,alpha=0.01,do_C_alpha=True)
blr.fit(x_trains,y_trains)
print(blr)
yhat = blr.predict(x_trainsv)
print('Accuracy of: ',accuracy_score(y_trainsv,yhat))
Line search algrothim:
initate weight vector
for line in lines:
get gradient
save_log = {}
for eta in range(0,1,iter):
get log likelihood using the graident, eta and save it in a dictanory
save_log[eta_i] = log likelihood
get the minimum value of log likelihood in the eta range
use the eta value that gives the minimum log likelihood
and use the eta*gradient to update the weights
%%time
# and we can update this to use a line search along the gradient like this:
from scipy.optimize import minimize_scalar
import copy
from numpy import ma # (masked array) this has most numpy functions that work with NaN data.
class LineSearchLogisticRegression(BinaryLogisticRegression):
# define custom line search for problem
def __init__(self, line_iters=0.0,do_plot=False, **kwds):
self.line_iters = line_iters
self.do_plot = do_plot
# but keep other keywords
super().__init__(**kwds) # call parent initializer
# this defines the function with the first input to be optimized
# therefore eta will be optimized, with all inputs constant
# https://stackoverflow.com/questions/21610198/runtimewarning-divide-by-zero-encountered-in-log by Chiraz BenAbdelkader Jul 1, 2020
@staticmethod
def safe_log(x, eps=1e-10):
result = np.where(x > eps, x, eps)
np.log(result, out=result, where=result > 0)
return result
# @staticmethod
def objective_function(self,eta,X,y,w,grad):
wnew = w - eta*(grad)
gi = expit(X @ wnew).ravel()
sw = self.get_sample_weight(y) # calculate sample weight of the class
loglike = - y*self.safe_log(gi) - (np.ones(y.shape)-y)*self.safe_log(1-gi)
# print(loglike)
if self.do_C:
loglike[1:] += self.C*np.sum(wnew[1:]**2 )
if self.do_alpha:
loglike[1:] += self.alpha *np.sum(np.absolute(wnew[1:]))
if self.do_C_alpha:
loglike[1:] += self.alpha *np.sum(np.absolute(wnew[1:]))+ self.C*np.sum(wnew[1:]**2)
if self.sample_weight:return np.sum(loglike*sw)
else: return np.sum(loglike)
def fit(self, X, y):
Xb = self._add_bias(X) # add bias term
num_samples, num_features = Xb.shape
self.w_ = np.zeros((num_features,1)) # init weight vector to zeros
for l in range(int(self.line_iters)):
# print('line ',l)
# temp_weight = np.zeros((num_features,1))
gradient = -self._get_gradient(Xb,y)
objective_dict={}
for eta_i in np.linspace(0,1,self.iters):
# print('eta i ', eta_i)
# print('objective grad',self.objective_gradient(self.w_,Xb,y,self.C))
# print('objective function ',self.objective_function(eta_i,Xb,y,self.w_,gradient,self.C))
# print('grad ', gradient)
objective_dict[eta_i] = self.objective_function(eta_i,Xb,y,self.w_,gradient)
# assert False, print(objective_dict)
min_eta = min(objective_dict, key=objective_dict.get)
# print('min_eta',min_eta)
min_logloss = min(objective_dict.values())
# print('min_logloss',min_logloss)
self.w_ -= gradient*min_eta
if self.do_plot:
plt.figure()
plt.plot(objective_dict.keys(),objective_dict.values())
plt.title(f'line {l} '+ f'min loss likelihood {min_logloss:.2f}, eta : {min_eta:.5f}')
# plt.text(0.5, 10,f'min loss likelihood {min_logloss:.2f}, eta : {min_eta:.5f}',horizontalalignment='center',verticalalignment='center')
plt.xlabel('eta')
plt.ylabel('log likelihood')
plt.show()
Here testing the line search, we see that with each line we minimize the log likelihood
%%time
lslr = LineSearchLogisticRegression(eta=0.01, # initial eta is not used. just checks different eta in the direction of eta
iterations=50, # this is important because it is how many uniformaly distrubited eta values are tested so, eta_range = np.linspace(0,1,iterations)
line_iters=10,
C=20,
alpha=10,
# do_plot=True,
do_C=True,
do_plot=True,
sample_weight=False
)
lslr.fit(x_trains,y_trains)
yhat = lslr.predict(x_trainsv)
print(lslr)
print('Accuracy of: ',accuracy_score(y_trainsv,yhat))
class LineSearchLogisticRegressionMulit:
def __init__(self, eta, iterations=20,line_iters=4,C=0.001,alpha=0.001,do_C=False,do_alpha =False,do_C_alpha=False,do_plot=False,sample_weight=False):
self.eta = eta
self.iters = iterations
self.line_iters = line_iters
self.C = C
self.do_C = do_C
self.alpha = alpha
self.do_alpha = do_alpha
self.do_C_alpha = do_C_alpha
self.do_plot = do_plot
self.sample_weight=sample_weight
# internally we will store the weights as self.w_ to keep with sklearn conventions
def __str__(self):
if(hasattr(self,'w_')):
return 'MultiClass Logistic Regression Object with coefficients:\n'+ str(self.w_) # is we have trained the object
else:
return 'Untrained MultiClass Logistic Regression Object'
def fit(self,X,y):
num_samples, num_features = X.shape
self.unique_ = np.unique(y) # get each unique class value
num_unique_classes = len(self.unique_)
self.classifiers_ = [] # will fill this array with binary classifiers
for i,yval in enumerate(self.unique_): # for each unique value
y_binary = (y==yval) # create a binary problem
# train the binary classifier for this class
blr = LineSearchLogisticRegression(eta=self.eta,
iterations = self.iters,
line_iters=self.line_iters,
C=self.C,
alpha=self.alpha,
do_C=self.do_C,
do_alpha=self.do_alpha,
do_C_alpha=self.do_C_alpha,
do_plot=self.do_plot,
sample_weight = self.sample_weight
)
blr.fit(X,y_binary)
# add the trained classifier to the list
self.classifiers_.append(blr)
# save all the weights into one matrix, separate column for each class
self.w_ = np.hstack([x.w_ for x in self.classifiers_]).T
def predict_proba(self,X):
probs = []
for blr in self.classifiers_:
probs.append(blr.predict_proba(X)) # get probability for each classifier
return np.hstack(probs) # make into single matrix
def predict(self,X):
return self.unique_[np.argmax(self.predict_proba(X),axis=1)] # take argmax along row
# lr = LineSearchLogisticRegressionMulit(0.1,10)
# print(lr)
%%time
lslr = LineSearchLogisticRegressionMulit(eta=0.1,
iterations=20, # this is important because it is how many uniformaly distrubited eta values are tested so, eta_range = np.linspace(0,1,iterations)
line_iters=10,
C=20,
alpha=4,
# do_plot=True,
do_C=True,
# do_plot=True,
# sample_weight=True
)
lslr.fit(x_trains,y_trains)
yhat = lslr.predict(x_trainsv)
print(lslr)
print('Accuracy of: ',accuracy_score(y_trainsv,yhat))
param = {
"iterations": 20,
"line_iters": 2,
"alpha": 0.01,
"eta" : 0.1,
"C": 10,
"do_C_alpha":True ,
# 'sample_weight': True,
}
clf = LineSearchLogisticRegressionMulit(**param)
clf.fit(x_trains,y_trains)
yhat = clf.predict(x_trainsv)
acc = accuracy_score(y_trainsv,yhat)
Using the optuna to optimize the hyperparameters C, alpha and no. of line. This is the sampled dataset i.e 20k events were sampled from Galaxy.
%%time
study_name = "LineSearchLogisticRegression" # Unique identifier of the study.
CV_RESULT_DIR = os.getcwd()+f"/{study_name}/"
if not os.path.exists(CV_RESULT_DIR): os.mkdir(CV_RESULT_DIR)
storage_name = "sqlite:///{}.db".format(study_name)
def objective(trial):
param = {
"iterations": 20,
"line_iters": trial.suggest_int("line_iters", 2, 100, log=True),
"alpha": trial.suggest_float("alpha", 1e-8, 40., log=True),
"eta" : 0.1,
"C": trial.suggest_float("C", 1e-8, 40, log=True),
# "do_C_alpha": ,
# 'sample_weight': True,
}
clf = LineSearchLogisticRegressionMulit(**param)
clf.fit(x_trains,y_trains)
yhat = clf.predict(x_trainsv)
acc = accuracy_score(y_trainsv,yhat)
return acc
# pruner = optuna.pruners.MedianPruner(n_warmup_steps=5)
# pruner = optuna.pruners.HyperbandPruner()
study = optuna.create_study(direction="maximize",storage=storage_name,study_name=study_name)
study.optimize(objective, n_trials=40)
print("Best trial:")
trial = study.best_trial
print(" Value: {}".format(trial.value))
print(" Params: ")
for key, value in trial.params.items():
print(" {}: {}".format(key, value))
Not great performance we can 65% accuracy
study_name = "LineSearchLogisticRegression" # Unique identifier of the study.
CV_RESULT_DIR = os.getcwd()+f"/{study_name}/"
if not os.path.exists(CV_RESULT_DIR): os.mkdir(CV_RESULT_DIR)
storage_name = "sqlite:///{}.db".format(study_name)
study = optuna.load_study(study_name=study_name, storage=storage_name)
trial = study.best_trial
best_params = study.best_params
History of the trials
plot_optimization_history(study).show()
3D plot of hyperparameters vs accuracy
plot_contour(study, params=['C','alpha']).show()
plot_contour(study, params=['C','line_iters']).show()
plot_contour(study, params=['alpha','line_iters']).show()
plot_slice(study)
plot_parallel_coordinate(study)
# linear boundaries visualization from sklearn documentation
from matplotlib import pyplot as plt
import copy
%matplotlib inline
plt.style.use('ggplot')
def plot_decision_boundaries(lr,Xin,y,title=''):
Xb = copy.deepcopy(Xin)
lr.fit(Xb[:,:2],y) # train only on two features
h=0.01
# create a mesh to plot in
x_min, x_max = Xb[:, 0].min() - 1, Xb[:, 0].max() + 1
y_min, y_max = Xb[:, 1].min() - 1, Xb[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
# get prediction values
Z = lr.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, cmap=plt.cm.Paired, alpha=0.5)
# Plot also the training points
plt.scatter(Xb[:, 0], Xb[:, 1], c=y, cmap=plt.cm.Paired)
plt.xlabel('PCA 0')
plt.ylabel('PCA 1')
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.xticks(())
plt.yticks(())
plt.title(title)
plt.show()
# lr = LogisticRegression(0.1,1500) # this is still OUR LR implementation, not sklearn
lslr = LineSearchLogisticRegressionMulit(eta=0.1, # initial eta is not used. just checks different eta in the direction of eta
iterations=20, # this is important because it is how many uniformaly distrubited eta values are tested so, eta_range = np.linspace(0,1,iterations)
line_iters=100,
C=26.48693164436823,
alpha=4.940325121692573e-06,
# do_plot=True,
do_C_alpha=True,
# do_plot=True,
# sample_weight=True
)
plot_decision_boundaries(lslr,x_trainsv.to_numpy(),y_trainsv)
study_name="LineSearchLogisticRegression"
storage_name = "sqlite:///{}.db".format(study_name)
study = optuna.load_study(study_name=study_name, storage=storage_name)
trial = study.best_trial
best_params = study.best_params
best_params["do_C_alpha"] = True
clf = LineSearchLogisticRegressionMulit(**best_params,eta=0.1,iterations=20)
clf.fit(x_trains.to_numpy(),y_trains)
yhat = clf.predict_proba(x_testsv)
ytrainhat = clf.predict_proba(x_trains)
yvalhat = clf.predict_proba(x_trainsv)
plt.figure(figsize=(15,15))
plt.subplot(3,2,1)
plot_sigbkg(0,"GALAXY",y_trains,ytrainhat,y_trainsv,yvalhat,y_testsv,yhat)
plt.subplot(3,2,2)
plot_roc(0,"GALAXY",y_trains,ytrainhat,y_trainsv,yvalhat,y_testsv,yhat)
# plt.show()
plt.subplot(323)
plot_sigbkg(1,"QSO",y_trains,ytrainhat,y_trainsv,yvalhat,y_testsv,yhat)
plt.subplot(324)
plot_roc(1,"QSO",y_trains,ytrainhat,y_trainsv,yvalhat,y_testsv,yhat)
# plt.show()
plt.subplot(325)
plot_sigbkg(2,"STAR",y_trains,ytrainhat,y_trainsv,yvalhat,y_testsv,yhat)
plt.subplot(326)
plot_roc(2,"STAR",y_trains,ytrainhat,y_trainsv,yvalhat,y_testsv,yhat)
plt.show()
Using the optuna to optimize the hyperparameters C, alpha and no. of line. This is without sampling but using sample weight, much better performance, but galaxy is not classified.
%%time
study_name = "LineSearchLogisticRegressionsw" # Unique identifier of the study.
CV_RESULT_DIR = os.getcwd()+f"/{study_name}/"
if not os.path.exists(CV_RESULT_DIR): os.mkdir(CV_RESULT_DIR)
storage_name = "sqlite:///{}.db".format(study_name)
def objective(trial):
param = {
"iterations": 20,
"line_iters": trial.suggest_int("line_iters", 2, 100, log=True),
"alpha": trial.suggest_float("alpha", 1e-8, 40., log=True),
"eta" : 0.1,
"C": trial.suggest_float("C", 1e-8, 40, log=True),
# "do_C_alpha": ,
'sample_weight': True,
}
clf = LineSearchLogisticRegressionMulit(**param)
clf.fit(x_train,y_train)
yhat = clf.predict(x_train1)
acc = accuracy_score(y_train1,yhat)
return acc
# pruner = optuna.pruners.MedianPruner(n_warmup_steps=5)
# pruner = optuna.pruners.HyperbandPruner()
study = optuna.create_study(direction="maximize",storage=storage_name,study_name=study_name)
study.optimize(objective, n_trials=50)
print("Best trial:")
trial = study.best_trial
print(" Value: {}".format(trial.value))
print(" Params: ")
for key, value in trial.params.items():
print(" {}: {}".format(key, value))
study_name = "LineSearchLogisticRegressionsw" # Unique identifier of the study.
CV_RESULT_DIR = os.getcwd()+f"/{study_name}/"
if not os.path.exists(CV_RESULT_DIR): os.mkdir(CV_RESULT_DIR)
storage_name = "sqlite:///{}.db".format(study_name)
study = optuna.load_study(study_name=study_name, storage=storage_name)
trial = study.best_trial
best_params = study.best_params
print(best_params)
plot_optimization_history(study).show()
plot_contour(study, params=['C','alpha']).show()
plot_contour(study, params=['C','line_iters']).show()
plot_contour(study, params=['alpha','line_iters']).show()
plot_slice(study)
plot_parallel_coordinate(study)
study_name="LineSearchLogisticRegressionsw"
storage_name = "sqlite:///{}.db".format(study_name)
study = optuna.load_study(study_name=study_name, storage=storage_name)
trial = study.best_trial
best_params = study.best_params
best_params["do_C_alpha"] = True
clf = LineSearchLogisticRegressionMulit(C= 0.21094460395008038, alpha= 3.781323294700339, line_iters= 98,eta=0.1,iterations=20,sample_weight=True)
clf.fit(x_train.to_numpy(),y_train)
yhat = clf.predict_proba(x_test1)
ytrainhat = clf.predict_proba(x_train)
yvalhat = clf.predict_proba(x_train1)
plt.figure(figsize=(15,15))
plt.subplot(3,2,1)
plot_sigbkg(0,"GALAXY",y_train,ytrainhat,y_train1,yvalhat,y_test1,yhat)
plt.subplot(3,2,2)
plot_roc(0,"GALAXY",y_train,ytrainhat,y_train1,yvalhat,y_test1,yhat)
# plt.show()
plt.subplot(323)
plot_sigbkg(1,"QSO",y_train,ytrainhat,y_train1,yvalhat,y_test1,yhat)
plt.subplot(324)
plot_roc(1,"QSO",y_train,ytrainhat,y_train1,yvalhat,y_test1,yhat)
# plt.show()
plt.subplot(325)
plot_sigbkg(2,"STAR",y_train,ytrainhat,y_train1,yvalhat,y_test1,yhat)
plt.subplot(326)
plot_roc(2,"STAR",y_train,ytrainhat,y_train1,yvalhat,y_test1,yhat)
plt.show()
Here I tried to optimize the hyperparameter by hand.
acc_ = []
for l in range(1,20):
lslr = LineSearchLogisticRegressionMulit(eta=0.1,
iterations=10, # this is important because it is how many uniformaly distrubited eta bins are tested so, eta_range = np.linspace(0,1,iterations)
line_iters=l,
C=0.08,
# alpha=4,
# do_plot=True,
do_C=True,
# do_plot=True,
# sample_weight=True
)
lslr.fit(x_train1,y_train1)
yhat = lslr.predict(x_test1)
acc_.append(accuracy_score(y_test1,yhat))
plt.figure()
plt.plot(acc_)
plt.xlabel('number of lines')
plt.ylabel('accuracy')
plt.show()
acc_ = []
for l in range(1,10):
lslr = LineSearchLogisticRegressionMulit(eta=0.1,
iterations=10, # this is important because it is how many uniformaly distrubited eta bins are tested so, eta_range = np.linspace(0,1,iterations)
line_iters=3,
C=l*0.1,
# alpha=4,
# do_plot=True,
do_C=True,
# do_plot=True,
# sample_weight=True
)
lslr.fit(x_train1,y_train1)
yhat = lslr.predict(x_test1)
acc_.append(accuracy_score(y_test1,yhat))
plt.figure()
plt.plot(np.arange(1,10)*0.1,acc_)
plt.xlabel('C')
plt.ylabel('accuracy')
plt.show()
acc_ = []
for l in range(1,10):
lslr = LineSearchLogisticRegressionMulit(eta=0.1,
iterations=10, # this is important because it is how many uniformaly distrubited eta bins are tested so, eta_range = np.linspace(0,1,iterations)
line_iters=3,
C=l,
# alpha=4,
# do_plot=True,
do_C=True,
# do_plot=True,
# sample_weight=True
)
lslr.fit(x_train1,y_train1)
yhat = lslr.predict(x_test1)
acc_.append(accuracy_score(y_test1,yhat))
plt.figure()
plt.plot(np.arange(1,10),acc_)
plt.xlabel('C')
plt.ylabel('accuracy')
plt.show()
acc_ = []
for l in range(10,20):
lslr = LineSearchLogisticRegressionMulit(eta=0.1,
iterations=10, # this is important because it is how many uniformaly distrubited eta bins are tested so, eta_range = np.linspace(0,1,iterations)
line_iters=3,
C=l,
# alpha=4,
# do_plot=True,
do_C=True,
# do_plot=True,
# sample_weight=True
)
lslr.fit(x_train1,y_train1)
yhat = lslr.predict(x_test1)
acc_.append(accuracy_score(y_test1,yhat))
plt.figure()
plt.plot(np.arange(10,20),acc_)
plt.xlabel('C')
plt.ylabel('accuracy')
plt.show()
acc_ = []
for l in range(1,10):
lslr = LineSearchLogisticRegressionMulit(eta=0.1,
iterations=10, # this is important because it is how many uniformaly distrubited eta bins are tested so, eta_range = np.linspace(0,1,iterations)
line_iters=3,
C=l*0.01,
# alpha=4,
# do_plot=True,
do_C=True,
# do_plot=True,
# sample_weight=True
)
lslr.fit(x_train1,y_train1)
yhat = lslr.predict(x_test1)
acc_.append(accuracy_score(y_test1,yhat))
plt.figure()
plt.plot(np.arange(1,10)*0.01,acc_)
plt.xlabel('C')
plt.ylabel('accuracy')
plt.show()
acc_ = []
for l in range(1,10):
lslr = LineSearchLogisticRegressionMulit(eta=0.1,
iterations=10, # this is important because it is how many uniformaly distrubited eta bins are tested so, eta_range = np.linspace(0,1,iterations)
line_iters=5,
# C=l*0.01,
alpha=l*0.01,
# do_plot=True,
do_alpha=True,
# do_plot=True,
# sample_weight=True
)
lslr.fit(x_train1,y_train1)
yhat = lslr.predict(x_test1)
acc_.append(accuracy_score(y_test1,yhat))
plt.figure()
plt.plot(np.arange(1,10)*0.01,acc_)
plt.xlabel('alpha')
plt.ylabel('accuracy')
plt.show()
acc_ = []
for l in range(1,10):
lslr = LineSearchLogisticRegressionMulit(eta=0.1,
iterations=10, # this is important because it is how many uniformaly distrubited eta bins are tested so, eta_range = np.linspace(0,1,iterations)
line_iters=5,
# C=l*0.01,
alpha=l*0.001,
# do_plot=True,
do_alpha=True,
# do_plot=True,
# sample_weight=True
)
lslr.fit(x_train1,y_train1)
yhat = lslr.predict(x_test1)
acc_.append(accuracy_score(y_test1,yhat))
plt.figure()
plt.plot(np.arange(1,10)*0.001,acc_)
plt.xlabel('alpha')
plt.ylabel('accuracy')
plt.show()
acc__ = {}
for l in range(1,10):
for c in range(1,10):
lslr = LineSearchLogisticRegressionMulit(eta=0.1,
iterations=10, # this is important because it is how many uniformaly distrubited eta bins are tested so, eta_range = np.linspace(0,1,iterations)
line_iters=5,
C=c*0.01,
alpha=l*0.001,
# do_plot=True,
do_C_alpha=True,
# do_plot=True,
# sample_weight=True
)
lslr.fit(x_train1,y_train1)
yhat = lslr.predict(x_test1)
acc__[l,c] = accuracy_score(y_test1,yhat)
# plt.figure()
# plt.plot(np.arange(1,10)*0.001,acc_)
# plt.xlabel('alpha')
# plt.ylabel('accuracy')
# plt.show()
pd.DataFrame(np.reshape(list(acc__.values()),(9,9)), columns=np.arange(1,10)*0.01,index=np.arange(1,10)*0.001)
max(acc__.values())
acc_ = []
for l in range(1,10):
lslr = LineSearchLogisticRegressionMulit(eta=0.1,
iterations=10, # this is important because it is how many uniformaly distrubited eta bins are tested so, eta_range = np.linspace(0,1,iterations)
line_iters=5,
C=0.08,
alpha=l*0.001,
# do_plot=True,
do_C_alpha=True,
# do_plot=True,
# sample_weight=True
)
lslr.fit(x_train1,y_train1)
yhat = lslr.predict(x_test1)
acc_.append(accuracy_score(y_test1,yhat))
plt.figure()
plt.plot(np.arange(1,10)*0.001,acc_)
plt.xlabel('alpha')
plt.ylabel('accuracy')
plt.show()
acc_ = []
for l in range(1,10):
lslr = LineSearchLogisticRegressionMulit(eta=0.1,
iterations=10, # this is important because it is how many uniformaly distrubited eta bins are tested so, eta_range = np.linspace(0,1,iterations)
line_iters=5,
# C=l*0.01,
alpha=l*0.01,
# do_plot=True,
do_alpha=True,
# do_plot=True,
# sample_weight=True
)
lslr.fit(x_train1,y_train1)
yhat = lslr.predict(x_test1)
acc_.append(accuracy_score(y_test1,yhat))
plt.figure()
plt.plot(np.arange(1,10)*0.01,acc_)
plt.xlabel('alpha')
plt.ylabel('accuracy')
plt.show()
Now doing Stochastic decent.
%%time
class StochasticLogisticRegression(BinaryLogisticRegression):
# stochastic gradient calculation
def _get_gradient(self,X,y):
idx = int(np.random.rand()*len(y)) # grab random instance
ydiff = y[idx]-self.predict_proba(X[idx],add_bias=False) # get y difference (now scalar)
gradient = X[idx] * ydiff[:,np.newaxis] # make ydiff a column vector and multiply through
gradient = gradient.reshape(self.w_.shape)
# gradient[1:] += -2 * self.w_[1:] * self.C
if self.do_C:
gradient[1:] += -2 * self.w_[1:] * self.C
if self.do_alpha:
# from https://stats.stackexchange.com/questions/45643/why-l1-norm-for-sparse-models
gradient[1:] += - self.alpha * np.sign(self.w_[1:])
if self.do_C_alpha:
gradient[1:] += - self.alpha * np.sign(self.w_[1:]) - 2 * self.w_[1:] * self.C
return gradient
slr = StochasticLogisticRegression(eta=0.9, iterations=1200, C=0.001,alpha=0.001,do_C_alpha=True) # take a lot more steps!!
slr.fit(x_train1.to_numpy(),y_train1.to_numpy())
yhat = slr.predict(x_test1)
print(slr)
print('Accuracy of: ',accuracy_score(y_test1,yhat))
class StochasticLogisticRegressionMulit:
def __init__(self, eta, iterations=20,C=0.001,alpha=0.001,do_C=False,do_alpha =False,do_C_alpha=False,do_plot=False):
self.eta = eta
self.iters = iterations
# self.line_iters = line_iters
self.C = C
self.do_C = do_C
self.alpha = alpha
self.do_alpha = do_alpha
self.do_C_alpha = do_C_alpha
self.do_plot = do_plot
# internally we will store the weights as self.w_ to keep with sklearn conventions
def __str__(self):
if(hasattr(self,'w_')):
return 'MultiClass Logistic Regression Object with coefficients:\n'+ str(self.w_) # is we have trained the object
else:
return 'Untrained MultiClass Logistic Regression Object'
def fit(self,X,y):
num_samples, num_features = X.shape
self.unique_ = np.unique(y) # get each unique class value
num_unique_classes = len(self.unique_)
self.classifiers_ = [] # will fill this array with binary classifiers
for i,yval in enumerate(self.unique_): # for each unique value
y_binary = (y==yval) # create a binary problem
# train the binary classifier for this class
blr = StochasticLogisticRegression(eta=self.eta,
iterations = self.iters,
# line_iters=self.line_iters,
C=self.C,
alpha=self.alpha,
do_C=self.do_C,
do_alpha=self.do_alpha,
do_C_alpha=self.do_C_alpha,
# do_plot=self.do_plot
)
blr.fit(X,y_binary)
# add the trained classifier to the list
self.classifiers_.append(blr)
# save all the weights into one matrix, separate column for each class
self.w_ = np.hstack([x.w_ for x in self.classifiers_]).T
def predict_proba(self,X):
probs = []
for blr in self.classifiers_:
probs.append(blr.predict_proba(X)) # get probability for each classifier
return np.hstack(probs) # make into single matrix
def predict(self,X):
return self.unique_[np.argmax(self.predict_proba(X),axis=1)] # take argmax along row
Optimizing Stochastic decent my hand.
%%time
acc_ = []
for l in np.linspace(0.001,1,50):
lslr = StochasticLogisticRegressionMulit(eta=l,
iterations=5000, # this is important because it is how many uniformaly distrubited eta bins are tested so, eta_range = np.linspace(0,1,iterations)
# line_iters=5,
# C=l*0.01,
# alpha=l*0.001,
# do_plot=True,
# do_alpha=True,
# do_plot=True,
# sample_weight=True
)
lslr.fit(x_train1.to_numpy(),y_train1.to_numpy())
yhat = lslr.predict(x_test1)
acc_.append(accuracy_score(y_test1,yhat))
plt.figure()
plt.plot(np.linspace(0.001,1,50),acc_)
plt.xlabel('eta')
plt.ylabel('accuracy')
plt.title('eta w/ 5000 iterations ')
plt.show()
acc_ = []
for l in np.arange(100,5000,200):
lslr = StochasticLogisticRegressionMulit(eta=0.1,
iterations=l, # this is important because it is how many uniformaly distrubited eta bins are tested so, eta_range = np.linspace(0,1,iterations)
# line_iters=5,
# C=l*0.01,
# alpha=l*0.001,
# do_plot=True,
# do_alpha=True,
# do_plot=True,
# sample_weight=True
)
lslr.fit(x_train1.to_numpy(),y_train1.to_numpy())
yhat = lslr.predict(x_test1)
acc_.append(accuracy_score(y_test1,yhat))
plt.figure()
plt.plot(np.arange(100,5000,200),acc_)
plt.xlabel('iterations')
plt.ylabel('accuracy')
plt.title('iterations w/ eta 0.1')
plt.show()
acc_ = []
for l in np.arange(100,5000,200):
lslr = StochasticLogisticRegressionMulit(eta=0.01,
iterations=l, # this is important because it is how many uniformaly distrubited eta bins are tested so, eta_range = np.linspace(0,1,iterations)
# line_iters=5,
# C=l*0.01,
# alpha=l*0.001,
# do_plot=True,
# do_alpha=True,
# do_plot=True,
# sample_weight=True
)
lslr.fit(x_train1.to_numpy(),y_train1.to_numpy())
yhat = lslr.predict(x_test1)
acc_.append(accuracy_score(y_test1,yhat))
plt.figure()
plt.plot(np.arange(100,5000,200),acc_)
plt.xlabel('iterations')
plt.ylabel('accuracy')
plt.title('iterations w/ eta 0.01')
plt.show()
acc_ = []
for l in np.arange(100,5000,200):
lslr = StochasticLogisticRegressionMulit(eta=0.5,
iterations=l, # this is important because it is how many uniformaly distrubited eta bins are tested so, eta_range = np.linspace(0,1,iterations)
# line_iters=5,
# C=l*0.01,
# alpha=l*0.001,
# do_plot=True,
# do_alpha=True,
# do_plot=True,
# sample_weight=True
)
lslr.fit(x_train1.to_numpy(),y_train1.to_numpy())
yhat = lslr.predict(x_test1)
acc_.append(accuracy_score(y_test1,yhat))
plt.figure()
plt.plot(np.arange(100,5000,200),acc_)
plt.xlabel('iterations')
plt.ylabel('accuracy')
plt.title('iterations w/ eta 0.5')
plt.show()
acc_ = []
for l in range(1,100):
lslr = StochasticLogisticRegressionMulit(eta=0.1,
iterations=1000, # this is important because it is how many uniformaly distrubited eta bins are tested so, eta_range = np.linspace(0,1,iterations)
# line_iters=5,
# C=l*0.01,
alpha=l*0.001,
# do_plot=True,
do_alpha=True,
# do_plot=True,
# sample_weight=True
)
lslr.fit(x_train1.to_numpy(),y_train1.to_numpy())
yhat = lslr.predict(x_test1)
acc_.append(accuracy_score(y_test1,yhat))
plt.figure()
plt.plot(np.arange(1,100)*0.001,acc_)
plt.xlabel('alpha')
plt.ylabel('accuracy')
plt.show()
%%time
acc_ = []
for l in range(1,100):
lslr = StochasticLogisticRegressionMulit(eta=0.1,
iterations=1000, # this is important because it is how many uniformaly distrubited eta bins are tested so, eta_range = np.linspace(0,1,iterations)
# line_iters=5,
# C=l*0.01,
alpha=l*0.0001,
# do_plot=True,
do_alpha=True,
# do_plot=True,
# sample_weight=True
)
lslr.fit(x_train1.to_numpy(),y_train1.to_numpy())
yhat = lslr.predict(x_test1)
acc_.append(accuracy_score(y_test1,yhat))
plt.figure()
plt.plot(np.arange(1,100)*0.0001,acc_)
plt.xlabel('alpha')
plt.ylabel('accuracy')
plt.show()
%%time
acc_ = []
for l in range(1,100):
lslr = StochasticLogisticRegressionMulit(eta=0.1,
iterations=1000, # this is important because it is how many uniformaly distrubited eta bins are tested so, eta_range = np.linspace(0,1,iterations)
# line_iters=5,
# C=l*0.01,
alpha=l*0.00001,
# do_plot=True,
do_alpha=True,
# do_plot=True,
# sample_weight=True
)
lslr.fit(x_train1.to_numpy(),y_train1.to_numpy())
yhat = lslr.predict(x_test1)
acc_.append(accuracy_score(y_test1,yhat))
plt.figure()
plt.plot(np.arange(1,100)*0.00001,acc_)
plt.xlabel('alpha')
plt.ylabel('accuracy')
plt.show()
%%time
acc_ = []
for l in range(1,100):
lslr = StochasticLogisticRegressionMulit(eta=0.1,
iterations=1000, # this is important because it is how many uniformaly distrubited eta bins are tested so, eta_range = np.linspace(0,1,iterations)
# line_iters=5,
C=l*0.01,
# alpha=l*0.0001,
# do_plot=True,
do_C=True,
# do_plot=True,
# sample_weight=True
)
lslr.fit(x_train1.to_numpy(),y_train1.to_numpy())
yhat = lslr.predict(x_test1)
acc_.append(accuracy_score(y_test1,yhat))
plt.figure()
plt.plot(np.arange(1,100)*0.01,acc_)
plt.xlabel('C')
plt.ylabel('accuracy')
plt.show()
%%time
acc_ = []
for l in range(1,100):
lslr = StochasticLogisticRegressionMulit(eta=0.1,
iterations=1000, # this is important because it is how many uniformaly distrubited eta bins are tested so, eta_range = np.linspace(0,1,iterations)
# line_iters=5,
C=l*0.001,
# alpha=l*0.0001,
# do_plot=True,
do_C=True,
# do_plot=True,
# sample_weight=True
)
lslr.fit(x_train1.to_numpy(),y_train1.to_numpy())
yhat = lslr.predict(x_test1)
acc_.append(accuracy_score(y_test1,yhat))
plt.figure()
plt.plot(np.arange(1,100)*0.001,acc_)
plt.xlabel('C')
plt.ylabel('accuracy')
plt.show()
%%time
acc_ = []
for l in range(1,100):
lslr = StochasticLogisticRegressionMulit(eta=0.1,
iterations=1000, # this is important because it is how many uniformaly distrubited eta bins are tested so, eta_range = np.linspace(0,1,iterations)
# line_iters=5,
C=l*0.0001,
# alpha=l*0.0001,
# do_plot=True,
do_C=True,
# do_plot=True,
# sample_weight=True
)
lslr.fit(x_train1.to_numpy(),y_train1.to_numpy())
yhat = lslr.predict(x_test1)
acc_.append(accuracy_score(y_test1,yhat))
plt.figure()
plt.plot(np.arange(1,100)*0.0001,acc_)
plt.xlabel('C')
plt.ylabel('accuracy')
plt.show()
I will optmize the Stochastic decent by optuna later and now let's do Stochastic decent using Hessian.
%%time
from numpy.linalg import pinv
class HessianBinaryLogisticRegression(BinaryLogisticRegression):
# just overwrite gradient function
def _get_gradient(self,X,y):
g = self.predict_proba(X,add_bias=False).ravel() # get sigmoid value for all classes
hessian = X.T @ np.diag(g*(1-g)) @ X - 2 * self.C # calculate the hessian
ydiff = y-g # get y difference
gradient = np.sum(X * ydiff[:,np.newaxis], axis=0) # make ydiff a column vector and multiply through
gradient = gradient.reshape(self.w_.shape)
# gradient[1:] += -2 * self.w_[1:] * self.C
if self.do_C:
gradient[1:] += -2 * self.w_[1:] * self.C
if self.do_alpha:
# from https://stats.stackexchange.com/questions/45643/why-l1-norm-for-sparse-models
gradient[1:] += - self.alpha * np.sign(self.w_[1:])
if self.do_C_alpha:
gradient[1:] += - self.alpha * np.sign(self.w_[1:]) - 2 * self.w_[1:] * self.C
return pinv(hessian) @ gradient
hlr = HessianBinaryLogisticRegression(eta=0.01,
iterations=10,
C=15,
alpha=10,
do_C_alpha=True) # note that we need only a few iterations here
# hlr.fit(X,y)
# yhat = hlr.predict(X)
# print(hlr)
# print('Accuracy of: ',accuracy_score(y,yhat))
class HessianBinaryLogisticRegressionMulit:
def __init__(self, eta, iterations=20,line_iters=4,C=0.001,alpha=0.001,do_C=False,do_alpha =False,do_C_alpha=False,do_plot=False):
self.eta = eta
self.iters = iterations
self.line_iters = line_iters
self.C = C
self.do_C = do_C
self.alpha = alpha
self.do_alpha = do_alpha
self.do_C_alpha = do_C_alpha
self.do_plot = do_plot
# internally we will store the weights as self.w_ to keep with sklearn conventions
def __str__(self):
if(hasattr(self,'w_')):
return 'MultiClass Logistic Regression Object with coefficients:\n'+ str(self.w_) # is we have trained the object
else:
return 'Untrained MultiClass Logistic Regression Object'
def fit(self,X,y):
num_samples, num_features = X.shape
self.unique_ = np.unique(y) # get each unique class value
num_unique_classes = len(self.unique_)
self.classifiers_ = [] # will fill this array with binary classifiers
for i,yval in enumerate(self.unique_): # for each unique value
y_binary = (y==yval) # create a binary problem
# train the binary classifier for this class
blr = HessianBinaryLogisticRegression(eta=self.eta,
iterations = self.iters,
# line_iters=self.line_iters,
C=self.C,
alpha=self.alpha,
do_C=self.do_C,
do_alpha=self.do_alpha,
do_C_alpha=self.do_C_alpha,
# do_plot=self.do_plot
)
blr.fit(X,y_binary)
# add the trained classifier to the list
self.classifiers_.append(blr)
# save all the weights into one matrix, separate column for each class
self.w_ = np.hstack([x.w_ for x in self.classifiers_]).T
def predict_proba(self,X):
probs = []
for blr in self.classifiers_:
probs.append(blr.predict_proba(X)) # get probability for each classifier
return np.hstack(probs) # make into single matrix
def predict(self,X):
return self.unique_[np.argmax(self.predict_proba(X),axis=1)] # take argmax along row
Wanted to see if the Stochastic decent using Hessian is working.
%%time
acc_ = []
for l in np.linspace(0.01,1,20):
lslr = HessianBinaryLogisticRegressionMulit(eta=l,
iterations=5,
# line_iters=5,
# C=l*0.01,
# alpha=l*0.001,
# do_plot=True,
# do_alpha=True,
# do_plot=True,
# sample_weight=True
)
lslr.fit(x_train1.to_numpy(),y_train1.to_numpy())
yhat = lslr.predict(x_test1)
acc_.append(accuracy_score(y_test1,yhat))
plt.figure()
plt.plot(np.linspace(0.01,1,20),acc_)
plt.xlabel('eta')
plt.ylabel('accuracy')
plt.title('eta w/ 5 iterations ')
plt.show()
Let's code up the MSE version of the above algorithms.
%%time
# from last time, our logistic regression algorithm is given by (including everything we previously had):
class BinaryLogisticMSERegression:
def __init__(self, eta, iterations=20, C=0.001,do_C=False,do_alpha =False,alpha=0.001,do_C_alpha=False,sample_weight=False):
self.eta = eta
self.iters = iterations
self.C = C
self.do_C = do_C
self.alpha = alpha
self.do_alpha = do_alpha
self.do_C_alpha = do_C_alpha
self.sample_weight = sample_weight
# internally we will store the weights as self.w_ to keep with sklearn conventions
def __str__(self):
if(hasattr(self,'w_')):
return 'Binary Logistic Regression MSE Object with coefficients:\n'+ str(self.w_) # is we have trained the object
else:
return 'Untrained Binary Logistic Regression Object'
# convenience, private:
@staticmethod
def _add_bias(X):
return np.hstack((np.ones((X.shape[0],1)),X)) # add bias term
@staticmethod
def _sigmoid(theta):
# increase stability, redefine sigmoid operation
return expit(theta) #1/(1+np.exp(-theta))
@staticmethod
def get_sample_weight(y):
sw = [len(y)/(len(np.unique(y))*i) for i in np.bincount(y) ]
return np.array([sw[j] for j in y])
# vectorized gradient calculation with regularization using L2 Norm and L1 Norm
def _get_gradient(self,X,y):
ydiff = self.predict_proba(X,add_bias=False).ravel() - y # get y difference
gradient = 2*np.mean(X * ydiff[:,np.newaxis], axis=0) # make ydiff a column vector and multiply through
gradient = gradient.reshape(self.w_.shape)
if self.do_C:
gradient[1:] += -2 * self.w_[1:] * self.C
if self.do_alpha:
# from https://stats.stackexchange.com/questions/45643/why-l1-norm-for-sparse-models
gradient[1:] += - self.alpha * np.sign(self.w_[1:])
if self.do_C_alpha:
gradient[1:] += - self.alpha * np.sign(self.w_[1:]) - 2 * self.w_[1:] * self.C
return gradient
# public:
def predict_proba(self,X,add_bias=True):
# add bias term if requested
Xb = self._add_bias(X) if add_bias else X
return Xb @ self.w_ # return the probability y=1
def predict(self,X):
return (self.predict_proba(X)>0.5) #return the actual prediction
def fit(self, X, y):
Xb = self._add_bias(X) # add bias term
num_samples, num_features = Xb.shape
self.w_ = np.zeros((num_features,1)) # init weight vector to zeros
# print(self.get_sample_weight(y))
# for as many as the max iterations
for _ in range(self.iters):
gradient = self._get_gradient(Xb,y)
self.w_ += gradient*self.eta # multiply by learning rate
# self.w_ *= self.get_sample_weight(y) # multiply weight with sample weight
# add bacause maximizing
blr = BinaryLogisticMSERegression(eta=0.1,iterations=50,C=0.01,alpha=0.01,do_C_alpha=True)
blr.fit(x_train,y_train)
print(blr)
yhat = blr.predict(x_test)
print('Accuracy of: ',accuracy_score(y_test,yhat))
%%time
# and we can update this to use a line search along the gradient like this:
from scipy.optimize import minimize_scalar
import copy
from numpy import ma # (masked array) this has most numpy functions that work with NaN data.
class LineSearchLogisticMSERegression(BinaryLogisticMSERegression):
# define custom line search for problem
def __init__(self, line_iters=0.0,do_plot=False, **kwds):
self.line_iters = line_iters
self.do_plot = do_plot
# but keep other keywords
super().__init__(**kwds) # call parent initializer
# this defines the function with the first input to be optimized
# therefore eta will be optimized, with all inputs constant
# https://stackoverflow.com/questions/21610198/runtimewarning-divide-by-zero-encountered-in-log by Chiraz BenAbdelkader Jul 1, 2020
@staticmethod
def safe_log(x, eps=1e-10):
result = np.where(x > eps, x, eps)
np.log(result, out=result, where=result > 0)
return result
def objective_function(self,eta,X,y,w,grad,C):
wnew = w - eta*(grad)
gi = X @ wnew
sw = self.get_sample_weight(y) # calculate sample weight of the class
ydiff = y-gi.ravel()
mse = np.square(ydiff)
# loglike = []
# for i,_y in enumerate(y):
# if _y == 1:
# loglike.append(-np.log(gi[i][0]) + self.C*sum(wnew**2))
# else:
# loglike.append(-np.log(1-gi[i][0]) + self.C*sum(wnew**2))
# # the line search is looking for minimization, so take the negative of l(w)
if self.sample_weight == True: return np.sum(mse*sw)
else: return np.sum(mse)
# multiply sample weight and cost function
# return -np.sum(y*self.safe_log(g)+(1-y)*self.safe_log(1-g))+ C*sum(wnew**2)
def fit(self, X, y):
Xb = self._add_bias(X) # add bias term
num_samples, num_features = Xb.shape
self.w_ = np.zeros((num_features,1)) # init weight vector to zeros
for l in range(int(self.line_iters)):
# print('line ',l)
# temp_weight = np.zeros((num_features,1))
gradient = -self._get_gradient(Xb,y)
objective_dict={}
for eta_i in np.linspace(0.00001,1,self.iters):
# print('eta i ', eta_i)
# print('objective grad',self.objective_gradient(self.w_,Xb,y,self.C))
# print('objective function ',self.objective_function(eta_i,Xb,y,self.w_,gradient,self.C))
# print('grad ', gradient)
objective_dict[eta_i] = self.objective_function(eta_i,Xb,y,self.w_,gradient,self.C)
# assert False, print(objective_dict)
min_eta = min(objective_dict, key=objective_dict.get)
# print('min_eta',min_eta)
min_logloss = min(objective_dict.values())
# print('min_logloss',min_logloss)
self.w_ -= gradient*min_eta
if self.do_plot:
plt.figure()
plt.plot(objective_dict.keys(),objective_dict.values())
plt.title(f'line {l} '+ f'min loss likelihood {min_logloss:.2f}, eta : {min_eta:.5f}')
# plt.text(0.5, 10,f'min loss likelihood {min_logloss:.2f}, eta : {min_eta:.5f}',horizontalalignment='center',verticalalignment='center')
plt.xlabel('eta')
plt.ylabel('log likelihood')
plt.show()
class LineSearchLogisticMSERegressionMulit:
def __init__(self, eta, iterations=20,line_iters=4,C=0.001,alpha=0.001,do_C=False,do_alpha =False,do_C_alpha=False,do_plot=False,sample_weight=False):
self.eta = eta
self.iters = iterations
self.line_iters = line_iters
self.C = C
self.do_C = do_C
self.alpha = alpha
self.do_alpha = do_alpha
self.do_C_alpha = do_C_alpha
self.do_plot = do_plot
self.sample_weight = sample_weight
# internally we will store the weights as self.w_ to keep with sklearn conventions
def __str__(self):
if(hasattr(self,'w_')):
return 'MultiClass Logistic Regression Object with coefficients:\n'+ str(self.w_) # is we have trained the object
else:
return 'Untrained MultiClass Logistic Regression Object'
def fit(self,X,y):
num_samples, num_features = X.shape
self.unique_ = np.unique(y) # get each unique class value
num_unique_classes = len(self.unique_)
self.classifiers_ = [] # will fill this array with binary classifiers
for i,yval in enumerate(self.unique_): # for each unique value
y_binary = (y==yval) # create a binary problem
# train the binary classifier for this class
blr = LineSearchLogisticMSERegression(eta=self.eta,
iterations = self.iters,
line_iters=self.line_iters,
C=self.C,
alpha=self.alpha,
do_C=self.do_C,
do_alpha=self.do_alpha,
do_C_alpha=self.do_C_alpha,
do_plot=self.do_plot )
blr.fit(X,y_binary)
# add the trained classifier to the list
self.classifiers_.append(blr)
# save all the weights into one matrix, separate column for each class
self.w_ = np.hstack([x.w_ for x in self.classifiers_]).T
def predict_proba(self,X):
probs = []
for blr in self.classifiers_:
probs.append(blr.predict_proba(X)) # get probability for each classifier
return np.hstack(probs) # make into single matrix
def predict(self,X):
return self.unique_[np.argmax(self.predict_proba(X),axis=1)] # take argmax along row
# lr = LineSearchLogisticRegressionMulit(0.1,10)
# print(lr)
Hyperparameter optmizing for MSE Line Search.
%%time
study_name = "LineSearchLogisticMSERegressionsw" # Unique identifier of the study.
CV_RESULT_DIR = os.getcwd()+f"/{study_name}/"
if not os.path.exists(CV_RESULT_DIR): os.mkdir(CV_RESULT_DIR)
storage_name = "sqlite:///{}.db".format(study_name)
def objective(trial):
param = {
"iterations": 20,
"line_iters": trial.suggest_int("line_iters", 2, 100, log=True),
"alpha": trial.suggest_float("alpha", 1e-8, 40., log=True),
"eta" : 0.1,
"C": trial.suggest_float("C", 1e-8, 40, log=True),
"do_C_alpha":True ,
'sample_weight': True,
}
clf = LineSearchLogisticMSERegressionMulit(**param)
clf.fit(x_train,y_train)
yhat = clf.predict(x_train1)
acc = accuracy_score(y_train1,yhat)
return acc
# pruner = optuna.pruners.MedianPruner(n_warmup_steps=5)
# pruner = optuna.pruners.HyperbandPruner()
study = optuna.create_study(direction="maximize",storage=storage_name,study_name=study_name)
study.optimize(objective, n_trials=50)
print("Best trial:")
trial = study.best_trial
print(" Value: {}".format(trial.value))
print(" Params: ")
for key, value in trial.params.items():
print(" {}: {}".format(key, value))
Not sure why this is not performing well. I will not plot the test vs train but the accuracy is so terrible.
I was optiminzing MSE Linesearch by hand, it do seem to perform much better than the loglikelihood.
%%time
acc_ = []
for l in range(1,10):
lslr = LineSearchLogisticMSERegressionMulit(eta=0.1,
iterations=20, # this is important because it is how many uniformaly distrubited eta bins are tested so, eta_range = np.linspace(0,1,iterations)
line_iters=50,
# C=l*0.01,
alpha=l*0.001,
# do_plot=True,
do_alpha=True,
# do_plot=True,
# sample_weight=True
)
lslr.fit(x_train1,y_train1)
yhat = lslr.predict(x_test1)
acc_.append(accuracy_score(y_test1,yhat))
plt.figure()
plt.plot(np.arange(1,10)*0.001,acc_)
plt.xlabel('alpha')
plt.ylabel('accuracy')
plt.show()
acc_ = []
for l in range(1,10):
lslr = LineSearchLogisticMSERegressionMulit(eta=0.1,
iterations=10, # this is important because it is how many uniformaly distrubited eta bins are tested so, eta_range = np.linspace(0,1,iterations)
line_iters=50,
# C=l*0.01,
alpha=l*0.0001,
# do_plot=True,
do_alpha=True,
# do_plot=True,
# sample_weight=True
)
lslr.fit(x_train1,y_train1)
yhat = lslr.predict(x_test1)
acc_.append(accuracy_score(y_test1,yhat))
plt.figure()
plt.plot(np.arange(1,10)*0.0001,acc_)
plt.xlabel('alpha')
plt.ylabel('accuracy')
plt.show()
%%time
acc_ = []
for l in range(1,10):
lslr = LineSearchLogisticMSERegressionMulit(eta=0.1,
iterations=10, # this is important because it is how many uniformaly distrubited eta bins are tested so, eta_range = np.linspace(0,1,iterations)
line_iters=50,
# C=l*0.01,
alpha=l*0.00001,
# do_plot=True,
do_alpha=True,
# do_plot=True,
# sample_weight=True
)
lslr.fit(x_train1,y_train1)
yhat = lslr.predict(x_test1)
acc_.append(accuracy_score(y_test1,yhat))
plt.figure()
plt.plot(np.arange(1,10)*0.00001,acc_)
plt.xlabel('alpha')
plt.ylabel('accuracy')
plt.show()
acc_ = []
for l in range(1,10):
lslr = LineSearchLogisticMSERegressionMulit(eta=0.1,
iterations=20, # this is important because it is how many uniformaly distrubited eta bins are tested so, eta_range = np.linspace(0,1,iterations)
line_iters=50,
# C=l*0.01,
alpha=l*0.1,
# do_plot=True,
do_alpha=True,
# do_plot=True,
# sample_weight=True
)
lslr.fit(x_train1,y_train1)
yhat = lslr.predict(x_test1)
acc_.append(accuracy_score(y_test1,yhat))
plt.figure()
plt.plot(np.arange(1,10)*0.1,acc_)
plt.xlabel('alpha')
plt.ylabel('accuracy')
plt.show()
acc_ = []
for l in range(1,10):
lslr = LineSearchLogisticMSERegressionMulit(eta=0.1,
iterations=20, # this is important because it is how many uniformaly distrubited eta bins are tested so, eta_range = np.linspace(0,1,iterations)
line_iters=50,
# C=l*0.01,
alpha=l*1,
# do_plot=True,
do_alpha=True,
# do_plot=True,
# sample_weight=True
)
lslr.fit(x_train1,y_train1)
yhat = lslr.predict(x_test1)
acc_.append(accuracy_score(y_test1,yhat))
plt.figure()
plt.plot(np.arange(1,10)*1,acc_)
plt.xlabel('alpha')
plt.ylabel('accuracy')
plt.show()
acc_ = []
for l in range(1,10):
lslr = LineSearchLogisticMSERegressionMulit(eta=0.1,
iterations=20, # this is important because it is how many uniformaly distrubited eta bins are tested so, eta_range = np.linspace(0,1,iterations)
line_iters=50,
C=l*0.01,
# alpha=4,
# do_plot=True,
do_C=True,
# do_plot=True,
# sample_weight=True
)
lslr.fit(x_train1,y_train1)
yhat = lslr.predict(x_test1)
acc_.append(accuracy_score(y_test1,yhat))
plt.figure()
plt.plot(np.arange(1,10)*0.01,acc_)
plt.xlabel('C')
plt.ylabel('accuracy')
plt.show()
acc_ = []
for l in range(1,10):
lslr = LineSearchLogisticMSERegressionMulit(eta=0.1,
iterations=20, # this is important because it is how many uniformaly distrubited eta bins are tested so, eta_range = np.linspace(0,1,iterations)
line_iters=50,
C=l*0.001,
# alpha=4,
# do_plot=True,
do_C=True,
# do_plot=True,
# sample_weight=True
)
lslr.fit(x_train1,y_train1)
yhat = lslr.predict(x_test1)
acc_.append(accuracy_score(y_test1,yhat))
plt.figure()
plt.plot(np.arange(1,10)*0.001,acc_)
plt.xlabel('C')
plt.ylabel('accuracy')
plt.show()
acc_ = []
for l in range(1,10):
lslr = LineSearchLogisticMSERegressionMulit(eta=0.1,
iterations=20, # this is important because it is how many uniformaly distrubited eta bins are tested so, eta_range = np.linspace(0,1,iterations)
line_iters=50,
C=l*0.0001,
# alpha=4,
# do_plot=True,
do_C=True,
# do_plot=True,
# sample_weight=True
)
lslr.fit(x_train1,y_train1)
yhat = lslr.predict(x_test1)
acc_.append(accuracy_score(y_test1,yhat))
plt.figure()
plt.plot(np.arange(1,10)*0.0001,acc_)
plt.xlabel('C')
plt.ylabel('accuracy')
plt.show()
acc_ = []
for l in range(1,10):
lslr = LineSearchLogisticMSERegressionMulit(eta=0.1,
iterations=10, # this is important because it is how many uniformaly distrubited eta bins are tested so, eta_range = np.linspace(0,1,iterations)
line_iters=50,
C=l*0.1,
# alpha=4,
# do_plot=True,
do_C=True,
# do_plot=True,
# sample_weight=True
)
lslr.fit(x_train1,y_train1)
yhat = lslr.predict(x_test1)
acc_.append(accuracy_score(y_test1,yhat))
plt.figure()
plt.plot(np.arange(1,10)*0.1,acc_)
plt.xlabel('C')
plt.ylabel('accuracy')
plt.show()
acc_ = []
for l in range(1,10):
lslr = LineSearchLogisticMSERegressionMulit(eta=0.1,
iterations=10, # this is important because it is how many uniformaly distrubited eta bins are tested so, eta_range = np.linspace(0,1,iterations)
line_iters=50,
C=l*1,
# alpha=4,
# do_plot=True,
do_C=True,
# do_plot=True,
# sample_weight=True
)
lslr.fit(x_train1,y_train1)
yhat = lslr.predict(x_test1)
acc_.append(accuracy_score(y_test1,yhat))
plt.figure()
plt.plot(np.arange(1,10)*1,acc_)
plt.xlabel('C')
plt.ylabel('accuracy')
plt.show()
MSE version of Stochastic decent
%%time
class StochasticLogisticMSERegression(BinaryLogisticMSERegression):
# stochastic gradient calculation
def _get_gradient(self,X,y):
idx = int(np.random.rand()*len(y)) # grab random instance
ydiff = y[idx]-self.predict_proba(X[idx],add_bias=False).ravel() # get y difference (now scalar)
gradient = 2*X[idx] * ydiff[:,np.newaxis] # make ydiff a column vector and multiply through
gradient = gradient.reshape(self.w_.shape)
# gradient[1:] += -2 * self.w_[1:] * self.C
if self.do_C:
gradient[1:] += -2 * self.w_[1:] * self.C
if self.do_alpha:
# from https://stats.stackexchange.com/questions/45643/why-l1-norm-for-sparse-models
gradient[1:] += - self.alpha * np.sign(self.w_[1:])
if self.do_C_alpha:
gradient[1:] += - self.alpha * np.sign(self.w_[1:]) - 2 * self.w_[1:] * self.C
return gradient
slr = StochasticLogisticMSERegression(eta=0.01, iterations=1200, C=0.001,alpha=0.001,do_C_alpha=True) # take a lot more steps!!
slr.fit(X,y)
yhat = slr.predict(X)
# print(slr)
print('Accuracy of: ',accuracy_score(y,yhat))
class StochasticLogisticMSERegressionMulit:
def __init__(self, eta, iterations=20,line_iters=4,C=0.001,alpha=0.001,do_C=False,do_alpha =False,do_C_alpha=False,do_plot=False):
self.eta = eta
self.iters = iterations
self.line_iters = line_iters
self.C = C
self.do_C = do_C
self.alpha = alpha
self.do_alpha = do_alpha
self.do_C_alpha = do_C_alpha
self.do_plot = do_plot
# internally we will store the weights as self.w_ to keep with sklearn conventions
def __str__(self):
if(hasattr(self,'w_')):
return 'MultiClass Logistic Regression Object with coefficients:\n'+ str(self.w_) # is we have trained the object
else:
return 'Untrained MultiClass Logistic Regression Object'
def fit(self,X,y):
num_samples, num_features = X.shape
self.unique_ = np.unique(y) # get each unique class value
num_unique_classes = len(self.unique_)
self.classifiers_ = [] # will fill this array with binary classifiers
for i,yval in enumerate(self.unique_): # for each unique value
y_binary = (y==yval) # create a binary problem
# train the binary classifier for this class
blr = StochasticLogisticMSERegression(eta=self.eta,
iterations = self.iters,
# line_iters=self.line_iters,
C=self.C,
alpha=self.alpha,
do_C=self.do_C,
do_alpha=self.do_alpha,
do_C_alpha=self.do_C_alpha)
# do_plot=self.do_plot )
blr.fit(X,y_binary)
# add the trained classifier to the list
self.classifiers_.append(blr)
# save all the weights into one matrix, separate column for each class
self.w_ = np.hstack([x.w_ for x in self.classifiers_]).T
def predict_proba(self,X):
probs = []
for blr in self.classifiers_:
probs.append(blr.predict_proba(X)) # get probability for each classifier
return np.hstack(probs) # make into single matrix
def predict(self,X):
return self.unique_[np.argmax(self.predict_proba(X),axis=1)] # take argmax along row
# lr = LineSearchLogisticRegressionMulit(0.1,10)
# print(lr)
%%time
acc_ = []
for l in np.linspace(0.001,0.2,40):
lslr = StochasticLogisticMSERegressionMulit(eta=l,
iterations=5000,
# line_iters=5,
# C=l*0.01,
# alpha=l*0.001,
# do_plot=True,
# do_alpha=True,
# do_plot=True,
# sample_weight=True
)
lslr.fit(x_train1.to_numpy(),y_train1.to_numpy())
yhat = lslr.predict(x_test1)
acc_.append(accuracy_score(y_test1,yhat))
plt.figure()
plt.plot(np.linspace(0.001,1,40),acc_)
plt.xlabel('eta')
plt.ylabel('accuracy')
plt.title('eta w/ 5000 iterations ')
plt.show()
%%time
acc_ = []
for l in np.linspace(0.001,0.1,40):
lslr = StochasticLogisticMSERegressionMulit(eta=l,
iterations=5000,
# line_iters=5,
# C=l*0.01,
# alpha=l*0.001,
# do_plot=True,
# do_alpha=True,
# do_plot=True,
# sample_weight=True
)
lslr.fit(x_train1.to_numpy(),y_train1.to_numpy())
yhat = lslr.predict(x_test1)
acc_.append(accuracy_score(y_test1,yhat))
plt.figure()
plt.plot(np.linspace(0.001,0.1,40),acc_)
plt.xlabel('eta')
plt.ylabel('accuracy')
plt.title('eta w/ 5000 iterations ')
plt.show()
%%time
from numpy.linalg import pinv
class HessianBinaryLogisticMSERegression(BinaryLogisticMSERegression):
# just overwrite gradient function
def _get_gradient(self,X,y):
g = self.predict_proba(X,add_bias=False).ravel() # get sigmoid value for all classes
if not self.do_C and not self.do_alpha and not self.do_C_alpha:
hessian = X.T @ X # calculate the hessian
if self.do_C:
hessian = X.T @ X - 2 * self.C # calculate the hessian
if self.do_alpha:
hessian = X.T @ X - self.alpha/np.linalg.norm(self.w_[1:])
if self.do_C_alpha:
hessian = X.T @ X - 2 * self.C - self.alpha/np.linalg.norm(self.w_[1:])
ydiff = y-g # get y difference
gradient = 2 * np.sum(X * ydiff[:,np.newaxis], axis=0) # make ydiff a column vector and multiply through
gradient = gradient.reshape(self.w_.shape)
# gradient[1:] += -2 * self.w_[1:] * self.C
if self.do_C:
gradient[1:] -= 2 * self.w_[1:] * self.C
if self.do_alpha:
gradient[1:] -= self.alpha * np.sign(self.w_[1:])
if self.do_C_alpha:
gradient[1:] -= self.alpha * np.sign(self.w_[1:]) + 2 * self.w_[1:] * self.C
return pinv(hessian) @ gradient
hlr = HessianBinaryLogisticMSERegression(eta=0.1,
iterations=20,
# C=15,
# alpha=10,
# do_C_alpha=True
) # note that we need only a few iterations here
hlr.fit(X,y)
yhat = hlr.predict(X)
print(hlr)
print('Accuracy of: ',accuracy_score(y,yhat))
class HessianBinaryLogisticMSERegressionMulit:
def __init__(self, eta, iterations=20,line_iters=4,C=0.001,alpha=0.001,do_C=False,do_alpha =False,do_C_alpha=False,do_plot=False):
self.eta = eta
self.iters = iterations
self.line_iters = line_iters
self.C = C
self.do_C = do_C
self.alpha = alpha
self.do_alpha = do_alpha
self.do_C_alpha = do_C_alpha
self.do_plot = do_plot
# internally we will store the weights as self.w_ to keep with sklearn conventions
def __str__(self):
if(hasattr(self,'w_')):
return 'MultiClass Logistic Regression Object with coefficients:\n'+ str(self.w_) # is we have trained the object
else:
return 'Untrained MultiClass Logistic Regression Object'
def fit(self,X,y):
num_samples, num_features = X.shape
self.unique_ = np.unique(y) # get each unique class value
num_unique_classes = len(self.unique_)
self.classifiers_ = [] # will fill this array with binary classifiers
for i,yval in enumerate(self.unique_): # for each unique value
y_binary = (y==yval) # create a binary problem
# train the binary classifier for this class
blr = HessianBinaryLogisticMSERegression(eta=self.eta,
iterations = self.iters,
# line_iters=self.line_iters,
C=self.C,
alpha=self.alpha,
do_C=self.do_C,
do_alpha=self.do_alpha,
do_C_alpha=self.do_C_alpha)
# do_plot=self.do_plot )
blr.fit(X,y_binary)
# add the trained classifier to the list
self.classifiers_.append(blr)
# save all the weights into one matrix, separate column for each class
self.w_ = np.hstack([x.w_ for x in self.classifiers_]).T
def predict_proba(self,X):
probs = []
for blr in self.classifiers_:
probs.append(blr.predict_proba(X)) # get probability for each classifier
return np.hstack(probs) # make into single matrix
def predict(self,X):
return self.unique_[np.argmax(self.predict_proba(X),axis=1)] # take argmax along row
# lr = LineSearchLogisticRegressionMulit(0.1,10)
# print(lr)
%%time
acc_ = []
for l in np.linspace(0.001,1,50):
lslr = StochasticLogisticMSERegressionMulit(eta=l,
iterations=5,
# line_iters=5,
# C=l*0.01,
# alpha=l*0.001,
# do_plot=True,
# do_alpha=True,
# do_plot=True,
# sample_weight=True
)
lslr.fit(x_train.to_numpy(),y_train.to_numpy())
yhat = lslr.predict(x_test)
acc_.append(accuracy_score(y_test,yhat))
plt.figure()
plt.plot(np.linspace(0.001,1,50),acc_)
plt.xlabel('eta')
plt.ylabel('accuracy')
plt.title('eta w/ 30 iterations ')
plt.show()
Now let's see the Stochastic Decent with log likelihood.
%%time
study_name = "StochasticLogisticRegressionMulit" # Unique identifier of the study.
CV_RESULT_DIR = os.getcwd()+f"/{study_name}/"
if not os.path.exists(CV_RESULT_DIR): os.mkdir(CV_RESULT_DIR)
storage_name = "sqlite:///{}.db".format(study_name)
def objective(trial):
param = {
"iterations": trial.suggest_int("iterations", 50, 5000, log=True),
"alpha": trial.suggest_float("alpha", 1e-8, 10., log=True),
"eta" : trial.suggest_float("eta", 1e-8, 1.0, log=True),
"C": trial.suggest_float("C", 1e-8, 10, log=True),
"do_C_alpha": True
}
clf = StochasticLogisticRegressionMulit(**param)
clf.fit(x_train.to_numpy(),y_train.to_numpy())
yhat = clf.predict(x_train1)
acc = accuracy_score(y_train1,yhat)
return acc
# pruner = optuna.pruners.MedianPruner(n_warmup_steps=5)
# pruner = optuna.pruners.HyperbandPruner()
study = optuna.create_study(direction="maximize",storage=storage_name,study_name=study_name)
study.optimize(objective, n_trials=400)
print("Best trial:")
trial = study.best_trial
print(" Value: {}".format(trial.value))
print(" Params: ")
for key, value in trial.params.items():
print(" {}: {}".format(key, value))
study_name="StochasticLogisticRegressionMulit"
storage_name = "sqlite:///{}.db".format(study_name)
study = optuna.load_study(study_name=study_name, storage=storage_name)
trial = study.best_trial
best_params = study.best_params
best_params["do_C_alpha"] = True
clf = StochasticLogisticRegressionMulit(**best_params)
clf.fit(x_train.to_numpy(),y_train.to_numpy())
yhat = clf.predict_proba(x_test1)
ytrainhat = clf.predict_proba(x_train)
yvalhat = clf.predict_proba(x_train1)
plt.figure(figsize=(15,15))
plt.subplot(3,2,1)
plot_sigbkg(0,"GALAXY")
plt.subplot(3,2,2)
plot_roc(0,"GALAXY")
# plt.show()
plt.subplot(323)
plot_sigbkg(1,"QSO")
plt.subplot(324)
plot_roc(1,"QSO")
# plt.show()
plt.subplot(325)
plot_sigbkg(2,"STAR")
plt.subplot(326)
plot_roc(2,"STAR")
plt.show()
study_name="StochasticLogisticRegressionMulit"
storage_name = "sqlite:///{}.db".format(study_name)
study = optuna.load_study(study_name=study_name, storage=storage_name)
plot_optimization_history(study).show()
plot_slice(study)
plot_contour(study, params=['C','alpha']).show()
plot_contour(study, params=['C','iterations']).show()
plot_contour(study, params=['alpha','iterations']).show()
plot_contour(study, params=['eta','iterations']).show()
plot_contour(study, params=['eta','C']).show()
plot_contour(study, params=['eta','alpha']).show()
Optmizing the Hessian Algorithm with loglikelihood.
%%time
study_name = "HessianBinaryLogisticRegressionMulit" # Unique identifier of the study.
CV_RESULT_DIR = os.getcwd()+f"/{study_name}/"
if not os.path.exists(CV_RESULT_DIR): os.mkdir(CV_RESULT_DIR)
storage_name = "sqlite:///{}.db".format(study_name)
def objective(trial):
param = {
"iterations": trial.suggest_int("iterations", 2, 300, log=True),
"alpha": trial.suggest_float("alpha", 1e-8, 10., log=True),
"eta" : trial.suggest_float("eta", 1e-8, 1.0, log=True),
"C": trial.suggest_float("C", 1e-8, 10, log=True),
"do_C_alpha": True
}
clf = HessianBinaryLogisticRegressionMulit(**param)
clf.fit(x_train1.to_numpy(),y_train1.to_numpy())
yhat = clf.predict(x_test1)
acc = accuracy_score(y_test1,yhat)
return acc
# pruner = optuna.pruners.MedianPruner(n_warmup_steps=5)
# pruner = optuna.pruners.HyperbandPruner()
study = optuna.create_study(direction="maximize",storage=storage_name,study_name=study_name)
study.optimize(objective, n_trials=40)
print("Best trial:")
trial = study.best_trial
print(" Value: {}".format(trial.value))
print(" Params: ")
for key, value in trial.params.items():
print(" {}: {}".format(key, value))
study_name="HessianBinaryLogisticRegressionMulit"
storage_name = "sqlite:///{}.db".format(study_name)
study = optuna.load_study(study_name=study_name, storage=storage_name)
trial = study.best_trial
best_params = study.best_params
best_params["do_C_alpha"] = True
clf = StochasticLogisticRegressionMulit(**best_params)
clf.fit(x_train.to_numpy(),y_train.to_numpy())
yhat = clf.predict_proba(x_test1)
ytrainhat = clf.predict_proba(x_train)
yvalhat = clf.predict_proba(x_train1)
plt.figure(figsize=(15,15))
plt.subplot(3,2,1)
plot_sigbkg(0,"GALAXY")
plt.subplot(3,2,2)
plot_roc(0,"GALAXY")
# plt.show()
plt.subplot(323)
plot_sigbkg(1,"QSO")
plt.subplot(324)
plot_roc(1,"QSO")
# plt.show()
plt.subplot(325)
plot_sigbkg(2,"STAR")
plt.subplot(326)
plot_roc(2,"STAR")
plt.show()
study_name="HessianBinaryLogisticRegressionMulit"
storage_name = "sqlite:///{}.db".format(study_name)
study = optuna.load_study(study_name=study_name, storage=storage_name)
plot_optimization_history(study).show()
plot_slice(study)
plot_contour(study, params=['C','alpha']).show()
plot_contour(study, params=['C','iterations']).show()
plot_contour(study, params=['alpha','iterations']).show()
plot_contour(study, params=['C','eta']).show()
plot_contour(study, params=['alpha','eta']).show()
MSE version of Linesearch
%%time
study_name = "LineSearchLogisticMSERegression" # Unique identifier of the study.
CV_RESULT_DIR = os.getcwd()+f"/{study_name}/"
if not os.path.exists(CV_RESULT_DIR): os.mkdir(CV_RESULT_DIR)
storage_name = "sqlite:///{}.db".format(study_name)
def objective(trial):
param = {
"iterations": 20,
"line_iters": trial.suggest_int("line_iters", 2, 100, log=True),
"alpha": trial.suggest_float("alpha", 1e-8, 10., log=True),
# "eta" : trial.suggest_float("eta", 1e-8, 1.0, log=True),
"C": trial.suggest_float("C", 1e-8, 10, log=True),
"do_C_alpha": True,
# 'sample_weight': True,
}
clf = LineSearchLogisticMSERegressionMulit(**param,eta=0)
clf.fit(x_train1.to_numpy(),y_train1.to_numpy())
yhat = clf.predict(x_test1.to_numpy())
acc = accuracy_score(y_test1,yhat)
return acc
# pruner = optuna.pruners.MedianPruner(n_warmup_steps=5)
# pruner = optuna.pruners.HyperbandPruner()
study = optuna.create_study(direction="maximize",storage=storage_name,study_name=study_name)
study.optimize(objective, n_trials=40)
print("Best trial:")
trial = study.best_trial
print(" Value: {}".format(trial.value))
print(" Params: ")
for key, value in trial.params.items():
print(" {}: {}".format(key, value))
%%time
study_name = "LineSearchLogisticMSERegressiona" # Unique identifier of the study.
CV_RESULT_DIR = os.getcwd()+f"/{study_name}/"
if not os.path.exists(CV_RESULT_DIR): os.mkdir(CV_RESULT_DIR)
storage_name = "sqlite:///{}.db".format(study_name)
def objective(trial):
param = {
"iterations": 20,
"line_iters": trial.suggest_int("line_iters", 2, 100, log=True),
"alpha": trial.suggest_float("alpha", 1e-8, 10., log=True),
"eta" : trial.suggest_float("eta", 1e-8, 1.0, log=True),
# "C": trial.suggest_float("C", 1e-8, 10, log=True),
"do_alpha": True,
# 'sample_weight': True,
}
clf = LineSearchLogisticMSERegressionMulit(**param)
clf.fit(x_train1.to_numpy(),y_train1.to_numpy())
yhat = clf.predict(x_test1.to_numpy())
acc = accuracy_score(y_test1,yhat)
return acc
# pruner = optuna.pruners.MedianPruner(n_warmup_steps=5)
# pruner = optuna.pruners.HyperbandPruner()
study = optuna.create_study(direction="maximize",storage=storage_name,study_name=study_name)
study.optimize(objective, n_trials=40)
print("Best trial:")
trial = study.best_trial
print(" Value: {}".format(trial.value))
print(" Params: ")
for key, value in trial.params.items():
print(" {}: {}".format(key, value))
%%time
study_name = "StochasticLogisticMSERegressionMulit" # Unique identifier of the study.
CV_RESULT_DIR = os.getcwd()+f"/{study_name}/"
if not os.path.exists(CV_RESULT_DIR): os.mkdir(CV_RESULT_DIR)
storage_name = "sqlite:///{}.db".format(study_name)
def objective(trial):
param = {
"iterations": trial.suggest_int("iterations", 50, 5000, log=True),
"alpha": trial.suggest_float("alpha", 1e-8, 10., log=True),
"eta" : trial.suggest_float("eta", 1e-8, 1.0, log=True),
"C": trial.suggest_float("C", 1e-8, 10, log=True),
"do_C_alpha": True
}
clf = StochasticLogisticMSERegressionMulit(**param)
clf.fit(x_train.to_numpy(),y_train.to_numpy())
yhat = clf.predict(x_train1)
acc = accuracy_score(y_train1,yhat)
return acc
# pruner = optuna.pruners.MedianPruner(n_warmup_steps=5)
# pruner = optuna.pruners.HyperbandPruner()
study = optuna.create_study(direction="maximize",storage=storage_name,study_name=study_name)
study.optimize(objective, n_trials=400)
print("Best trial:")
trial = study.best_trial
print(" Value: {}".format(trial.value))
print(" Params: ")
for key, value in trial.params.items():
print(" {}: {}".format(key, value))
from optuna.visualization import plot_contour
from optuna.visualization import plot_edf
from optuna.visualization import plot_intermediate_values
from optuna.visualization import plot_optimization_history
from optuna.visualization import plot_parallel_coordinate
from optuna.visualization import plot_param_importances
from optuna.visualization import plot_slice
import plotly.io as pio
pio.renderers.default = 'iframe' # or 'notebook' or 'colab' or 'jupyterlab'
def get_sample_weight(y):
sw = [len(y)/(len(np.unique(y))*i) for i in np.bincount(y) ]
return np.array([sw[j] for j in y])
study_name = "xgboost" # Unique identifier of the study.
CV_RESULT_DIR = os.getcwd()+f"/{study_name}/"
if not os.path.exists(CV_RESULT_DIR): os.mkdir(CV_RESULT_DIR)
storage_name = "sqlite:///{}.db".format(study_name)
def objective(trial):
param = {
"num_class":3,
"verbosity": 0,
"objective": "multi:softprob",
"eval_metric": "auc",
"tree_method": "hist",
"booster": trial.suggest_categorical("booster", ["gbtree", "gblinear", "dart"]),
"lambda": trial.suggest_float("lambda", 1e-8, 1.0, log=True),
"alpha": trial.suggest_float("alpha", 1e-8, 1.0, log=True),
}
if param["booster"] == "gbtree" or param["booster"] == "dart":
param["max_depth"] = trial.suggest_int("max_depth", 1, 9)
param["eta"] = trial.suggest_float("eta", 1e-8, 1.0, log=True)
param["gamma"] = trial.suggest_float("gamma", 1e-8, 1.0, log=True)
param["grow_policy"] = trial.suggest_categorical("grow_policy", ["depthwise", "lossguide"])
if param["booster"] == "dart":
param["sample_type"] = trial.suggest_categorical("sample_type", ["uniform", "weighted"])
param["normalize_type"] = trial.suggest_categorical("normalize_type", ["tree", "forest"])
param["rate_drop"] = trial.suggest_float("rate_drop", 1e-8, 1.0, log=True)
param["skip_drop"] = trial.suggest_float("skip_drop", 1e-8, 1.0, log=True)
results_dict={}
evals_result = {}
dtrain = xgb.DMatrix(x_train, y_train,get_sample_weight(y_train))
dtest = xgb.DMatrix(x_train1, y_train1, get_sample_weight(y_train1))
dtest_unseen = xgb.DMatrix(x_test1, y_test1,get_sample_weight(y_test1) )
pruning_callback = optuna.integration.XGBoostPruningCallback(trial, "test-auc")
watchlist = [(dtest, 'test'), (dtrain, 'train')]
xgbc0 = xgb.train(param, dtrain,evals=watchlist,callbacks=[pruning_callback], evals_result=evals_result)
pd.DataFrame.from_dict({'test-auc':evals_result['test']['auc'],'train-auc':evals_result['train']['auc']}).to_csv(CV_RESULT_DIR+f'{trial.number}.csv')
trial.set_user_attr("n_estimators", xgbc0.best_iteration)
# test_score = xgbc0.predict(dtest_unseen)#y_dftest
# train_score = xgbc0.predict(dtrain) #y_train
# accuracy_test = sklearn.metrics.accuracy_score(y_dftest, np.rint(test_score))
# accuracy_train = sklearn.metrics.accuracy_score(y_train, np.rint(train_score))
# best_score = evals_result['test']['logloss'][-1]
# print('best_score',best_score)
# test_train_result = [abs(evals_result['test']['logloss'][-1]),abs(evals_result['train']['logloss'][-1])]
# dratio_logloss = min(test_train_result)/max(test_train_result)
# if (accuracy_test/accuracy_train)*100.0 < 93.0:
# print(f' Trial Pruned test/train Accuracy : {(accuracy_test/accuracy_train)*100.0}')
# raise optuna.TrialPruned()
# preds = xgbc0.predict(dtest)
# pred_labels = np.rint(preds)
# accuracy = accuracy_score(y_train1, pred_labels)
# return accuracy
return evals_result['test']['auc'][-1]
pruner = optuna.pruners.MedianPruner(n_warmup_steps=5)
# pruner = optuna.pruners.HyperbandPruner()
study = optuna.create_study(direction="maximize",pruner=pruner,storage=storage_name,study_name=study_name)
study.optimize(objective, n_trials=100)
print("Best trial:")
trial = study.best_trial
print(" Value: {}".format(trial.value))
print(" Params: ")
for key, value in trial.params.items():
print(" {}: {}".format(key, value))
print(" Number of estimators: {}".format(trial.user_attrs["n_estimators"]))
def get_sample_weight(y):
sw = [len(y)/(len(np.unique(y))*i) for i in np.bincount(y) ]
return np.array([sw[j] for j in y])
study_name="xgboost"
storage_name = "sqlite:///{}.db".format(study_name)
study = optuna.load_study(study_name=study_name, storage=storage_name)
trial = study.best_trial
best_params = study.best_params
dtrain = xgb.DMatrix(x_train, y_train,get_sample_weight(y_train))
dtest = xgb.DMatrix(x_train1, y_train1, get_sample_weight(y_train1))
dtest_unseen = xgb.DMatrix(x_test1, y_test1,get_sample_weight(y_test1) )
watchlist = [(dtest, 'test'), (dtrain, 'train')]
# watchlist = [(x_train1,y_train1), (x_train,y_train)]
param = {}
param['num_class']=3
param['objective']="multi:softprob"
param['eval_metric']="auc"
param['tree_method']="hist"
best_params['num_class']=3
best_params['objective']="multi:softprob"
best_params['eval_metric']="auc"
best_params['tree_method']="hist"
param['n_estimators']=trial.user_attrs["n_estimators"]
param['reg_alpha']=best_params['alpha']
param['reg_lambda']=best_params['lambda']
param['learning_rate']=best_params['eta']
param['booster']=best_params['booster']
param['grow_policy']=best_params['grow_policy']
param['max_depth']=best_params['max_depth']
param['gamma']=best_params['gamma']
xgbc0 = xgb.train(best_params, dtrain, evals=watchlist,num_boost_round=trial.user_attrs["n_estimators"])
yhat = xgbc0.predict(dtest_unseen)
ytrainhat = xgbc0.predict(dtrain)
yvalhat = xgbc0.predict(dtest)
# xgbc0 = xgb.XGBClassifier(**param)
# xgbc0.fit(x_train, y_train.to_numpy(),sample_weight=get_sample_weight(y_train),eval_set=watchlist)
# preds = xgbc0.predict_proba(x_test1)
ytrainhat = xgbc0.predict(dtrain)
yvalhat = xgbc0.predict(dtest)
def plot_sigbkg(class_,name):
# plt.figure(figsize=(6, 4))
plt.hist(yhat[:, class_][y_test1==class_],label='test sig',density=True,alpha=0.5)
plt.hist(yhat[:, class_][y_test1!=class_],label='test bkg',density=True,alpha=0.5)
plt.hist(ytrainhat[:, class_][y_train==class_],label='train sig',density=True,histtype='step')
plt.hist(ytrainhat[:, class_][y_train!=class_],label='train bkg',density=True,histtype='step')
counts,bin_edges = np.histogram(yvalhat[:, class_][y_train1==class_],density=True)
bin_centers = (bin_edges[:-1] + bin_edges[1:])/2.
plt.plot(bin_centers, counts,marker="o",linestyle="None",label="val sig")
counts,bin_edges = np.histogram(yvalhat[:, class_][y_train1!=class_],density=True)
bin_centers = (bin_edges[:-1] + bin_edges[1:])/2.
plt.plot(bin_centers, counts,marker="*",linestyle="None",label="val bkg")
plt.title(name)
plt.legend()
plt.tight_layout()
# plt.show()
def plot_roc(class_,name):
# plt.figure(figsize=(6,4))
roc1 = roc_curve(1*(y_test1==class_) , yhat[:, class_])
fpr,tpr,_=roc1
plt.plot(fpr, tpr, 'b',label=f'test (area = {auc(fpr,tpr)*100:.1f})%')
roc1 = roc_curve(1*(y_train==class_) , ytrainhat[:, class_])
fpr,tpr,_=roc1
plt.plot(fpr, tpr, 'r',label=f'train (area = {auc(fpr,tpr)*100:.1f})%')
roc1 = roc_curve(1*(y_train1==class_) , yvalhat[:, class_])
fpr,tpr,_=roc1
plt.plot(fpr, tpr, 'g',label=f'val (area = {auc(fpr,tpr)*100:.1f})%')
plt.legend()
plt.title(name)
# plt.show()
plt.figure(figsize=(15,15))
plt.subplot(3,2,1)
plot_sigbkg(0,"GALAXY")
plt.subplot(3,2,2)
plot_roc(0,"GALAXY")
# plt.show()
plt.subplot(323)
plot_sigbkg(1,"QSO")
plt.subplot(324)
plot_roc(1,"QSO")
# plt.show()
plt.subplot(325)
plot_sigbkg(2,"STAR")
plt.subplot(326)
plot_roc(2,"STAR")
plt.show()
study_name="xgboost"
storage_name = "sqlite:///{}.db".format(study_name)
study = optuna.load_study(study_name=study_name, storage=storage_name)
plot_optimization_history(study).show()
plot_slice(study)
plot_contour(study,params=['alpha','eta'])
plot_contour(study,params=['alpha','lambda'])
plot_contour(study,params=['gamma','lambda'])
plot_parallel_coordinate(study)
I see that using the MSE lost function is much faster than the log likelihood, and so optmizing the MSE version of optmizing technique much faster. I could comfortably test 400 trials for Stochastic decent but training 40 trails for Line search takes more than 100x the time. None of the handmade algrothim are performing nicely, most of them have accuracy of in range of 70% to 75%. Although, I did not test sklearn version of the logistic regression instead xgboost[4] was used which is giving usuable results, ~95% accuracy.
[1] Abdurro’uf et al., The Seventeenth data release of the Sloan Digital Sky Surveys: Complete Release of MaNGA, MaStar and APOGEE-2 DATA (Abdurro’uf et al. submitted to ApJS) [arXiv:2112.02026] https://www.sdss.org/dr17/
[2] Fedesoriano. (January 2022). Stellar Classification Dataset - SDSS17. Retrieved [15/10/2022] from https://www.kaggle.com/fedesoriano/stellar-classification-dataset-sdss17
[3] WESSAM SALAH WALID (September 2022). Stellar Classification - SDSS17 (4 ML Models). Retrieved [15/10/2022] from https://www.kaggle.com/code/wessamwalid/stellar-classification-sdss17-4-ml-models#Missing-Value-Analysis
[4] Chen, T., & Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 785–794). New York, NY, USA: ACM. https://doi.org/10.1145/2939672.2939785